AITopics | context size

Collaborating Authors

context size

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

A Hierarchical Language Model with Predictable Scaling Laws and Provable Benefits of Reasoning

Gaitonde, Jason, Koehler, Frederic, Mossel, Elchanan, Shin, Joonhyung, Sly, Allan

arXiv.org Machine LearningMay-14-2026

We introduce a family of synthetic languages with hierarchical structure -- generated by a broadcast process on trees -- for which the role of context length and reasoning in autoregressive generation can be analyzed precisely. At the heart of our analytic approach is an \emph{exact $k$-gram ansatz} in place of transformers with context length $k$, a substitution we then validate empirically. Using this ansatz we derive explicit asymptotic predictions for distributional statistics of the sequences produced by a trained model, instantiated in two settings. For the \emph{Ising broadcast process} (a soft-constrained language), we prove that the variance of the generated sum scales log-linearly in the context depth and its kurtosis converges to that of a Gaussian -- both deviating from the true language for any sublinear context. For the \emph{coloring broadcast process} (a hard-constrained language) in the freezing regime, bounded-context autoregression produces sequences that, with high probability, are inconsistent with \emph{any} valid coloring of the underlying tree. Together these results imply an $Ω(n)$ lower bound on the context length required to faithfully sample length-$n$ sequences. In contrast, we prove that an autoregressive \emph{reasoning} model with only $Θ(\log n)$ working memory can sample exactly from the true language -- an exponential improvement. We confirm both the lower-bound predictions and the reasoning-based upper bound empirically with transformers trained on the synthetic language; the trained models track our asymptotic predictions quantitatively across a wide range of context sizes.

large language model, machine learning, natural language, (21 more...)

arXiv.org Machine Learning

2605.13687

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

Slagle - 2024 - SpaceByte: Towards Deleting Tokenization from Large Language Modeling

Kevin Slagle

Neural Information Processing SystemsFeb-18-2026, 10:43:50 GMT

In this work, we study the performance of byte-level and subword-level autoregressive models when trained using a fixed compute budget.

large language model, machine learning, spacebyte, (22 more...)

Neural Information Processing Systems

Country:

Asia > Singapore (0.04)
North America > United States > Washington > King County > Seattle (0.04)
North America > Dominican Republic (0.04)
(2 more...)

Genre: Research Report > Experimental Study (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.65)

Add feedback

Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers

Neural Information Processing SystemsFeb-17-2026, 04:23:39 GMT

Despite several works trying to reduce their computational cost, most of LLMs still adopt attention layers between all pairs of tokens in the sequence, thus incurring a quadratic cost. In this study, we present a novel approach that dynamically prunes contextual information while preserving the model's expressiveness, resulting in reduced memory and computational

arxiv preprint arxiv, large language model, machine learning, (16 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Lebanon (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Oregon > Linn County > Lebanon (0.04)
(7 more...)

Genre:

Research Report > Promising Solution (0.87)
Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Appendix of Deep Stochastic Processes via Functional Markov Transition Operator A Proofs

Neural Information Processing SystemsFeb-14-2026, 22:53:00 GMT

A.1 Proof of proposition 4.1 (See page 4) A.2 Proof of proposition 4.2 (See page 4) MTOs in the form of Equation ( 9) are consistent and exchangeable. MTO s are consistent and exchangeable for the general form. These convex functions are then randomly shifted and rescaled to increase diversity. To circumvent memory issues, we use deep sets in this instance. Note that we do not share parameters among iterations.

artificial intelligence, equation, machine learning, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.70)

Add feedback

Neural Code Comprehension: A Learnable Representation of Code Semantics

Tal Ben-Nun, Alice Shoshana Jakobovits, Torsten Hoefler

Neural Information Processing SystemsFeb-12-2026, 08:11:19 GMT

Neural Information Processing Systems http://nips.cc/

international conference, proceedings, representation, (12 more...)

Neural Information Processing Systems

Country:

Europe > Switzerland > Zürich > Zürich (0.05)
North America > United States > District of Columbia > Washington (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
(4 more...)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

4e85362c02172c0c6567ce593122d31c-Paper-Conference.pdf

Neural Information Processing SystemsFeb-11-2026, 15:47:09 GMT

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(11 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Decoupled-Value Attention for Prior-Data Fitted Networks: GP Inference for Physical Equations

Sharma, Kaustubh, Singh, Simardeep, Pareek, Parikshit

arXiv.org Artificial IntelligenceDec-2-2025

Prior-data fitted networks (PFNs) are a promising alternative to time-consuming Gaussian process (GP) inference for creating fast surrogates of physical systems. PFN reduces the computational burden of GP-training by replacing Bayesian inference in GP with a single forward pass of a learned prediction model. We introduce Decoupled-V alue Attention (DV A)- motivated by the GP property that the function space is fully characterized by the kernel over inputs and the predictive mean is a weighted sum of training targets. DV A computes similarities from inputs only and propagates labels solely through values. Thus, the proposed DV A mirrors the GP update while remaining kernel-free. We demonstrate that PFNs are backbone architecture invariant and the crucial factor for scaling PFNs is the attention rule rather than the architecture itself. Specifically, our results demonstrate that (a) localized attention consistently reduces out-of-sample validation loss in PFNs across different dimensional settings, with validation loss reduced by more than 50% in five-and ten-dimensional cases, and (b) the role of attention is more decisive than the choice of backbone architecture, showing that CNN, RNN and LSTM-based PFNs can perform at par with their Transformer-based counterparts. Bayesian inference provides a powerful framework for reasoning under uncertainty, with methods like Gaussian processes (GPs) offering well-calibrated predictions and principled uncertainty estimates (Williams & Rasmussen, 2006). However, the practical application of these methods is often hindered by the heavy computational burden of learning kernel hyperparameters. For example, exact GP inference scales cubically with the number of data points, making its deployment infeasible for large datasets or problems requiring repeated training. Consider a physical system where a surrogate GP is chosen due to its uncertainty estimates and differentiable closed-form expressions. However, the underlying input dataset and configuration changes frequently, and the surrogate is supposed to work for these new, previously unseen variations. In such conditions, GP needs to be trained repeatedly, incurring significant computing cost, each time the dataset changes.

artificial intelligence, machine learning, pfn, (19 more...)

arXiv.org Artificial Intelligence

2509.2095

Country:

Asia > India (0.28)
North America (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Energy (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Predicting the Formation of Induction Heads

Aoyama, Tatsuya, Wilcox, Ethan Gotlieb, Schneider, Nathan

arXiv.org Artificial IntelligenceNov-24-2025

Arguably, specialized attention heads dubbed induction heads (IHs) underlie the remarkable in-context learning (ICL) capabilities of modern language models (LMs); yet, a precise characterization of their formation remains unclear. In this study, we investigate the relationship between statistical properties of training data (for both natural and synthetic data) and IH formation. We show that (1) a simple equation combining batch size and context size predicts the point at which IHs form; (2) surface bigram repetition frequency and reliability strongly affect the formation of IHs, and we find a precise Pareto frontier in terms of these two values; and (3) local dependency with high bigram repetition frequency and reliability is sufficient for IH formation, but when the frequency and reliability are low, categoriality and the shape of the marginal distribution matter.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.16893

Country: Europe > Austria (0.28)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.47)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)

Add feedback

Neural Code Comprehension: A Learnable Representation of Code Semantics

Tal Ben-Nun, Alice Shoshana Jakobovits, Torsten Hoefler

Neural Information Processing SystemsNov-20-2025, 14:43:35 GMT

Sifting through, categorizing, and understanding code thus becomes an essential task for a variety of fields.

machine learning, natural language, programming language, (16 more...)

Neural Information Processing Systems

Country:

Europe > Switzerland > Zürich > Zürich (0.05)
North America > United States > District of Columbia > Washington (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
(4 more...)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

The Markovian Thinker: Architecture-Agnostic Linear Scaling of Reasoning

Aghajohari, Milad, Chitsaz, Kamran, Kazemnejad, Amirhossein, Chandar, Sarath, Sordoni, Alessandro, Courville, Aaron, Reddy, Siva

arXiv.org Artificial IntelligenceNov-14-2025

Reinforcement learning (RL) has recently become a strong recipe for training reasoning LLMs that produce long chains of thought (LongCoT). Yet the standard RL "thinking environment", where the state is the prompt plus all prior reasoning tokens, makes the state unbounded and forces attention-based policies to pay quadratic compute as thoughts lengthen. We revisit the environment itself. We propose Markovian Thinking, a paradigm in which the policy advances reasoning while conditioning on a constant-size state, decoupling thinking length from context size. As an immediate consequence this yields linear compute with constant memory. We instantiate this idea with Delethink, an RL environment that structures reasoning into fixed-size chunks. Within each chunk, the model thinks as usual; at the boundary, the environment resets the context and reinitializes the prompt with a short carryover. Through RL, the policy learns to write a textual state near the end of each chunk sufficient for seamless continuation of reasoning after reset. Trained in this environment, an R1-Distill 1.5B model reasons in 8K-token chunks yet thinks up to 24K tokens, matching or surpassing LongCoT-RL trained with a 24K budget. With test-time scaling, Delethink continues to improve where LongCoT plateaus. The effect of linear compute is substantial: we empirically estimate at 96K average thinking length LongCoT-RL costs 27 H100-months vs. 7 for Delethink. Analysis at RL initialization shows off-the-shelf reasoning models (1.5B-120B) often sample Markovian traces zero-shot across diverse benchmarks, providing positive samples that make RL effective at scale. Our results show that redesigning the thinking environment is a powerful lever: it enables very long reasoning without quadratic overhead and opens a path toward efficient, scalable reasoning LLMs.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2510.06557

Country: North America > Canada > Quebec (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback